Skip to content

ML-Cheat-Sheet

Basic Rules of Differentiation

Basic Rules

  • Constant Rule: ddxC=0

  • Power Rule: ddxxn=nxn1

  • Linear Combination: ddx[af(x)+bg(x)]=af(x)+bg(x)

  • Product Rule: ddx[f(x)g(x)]=f(x)g(x)+f(x)g(x)

  • Quotient Rule: ddx[f(x)g(x)]=f(x)g(x)f(x)g(x)[g(x)]2

  • Chain Rule: ddxf(g(x))=f(g(x))g(x)

  • Exponential:ddxex=ex | |ddxax=axln(a)

  • Logarithmic ddxln(x)=1x || ddxloga(x)=1xln(a)

Linear Regression

1. Hypothesis

hθ(x)=θTx=θ0+θ1x1++θnxn

2. Cost Function

Mean Squared Error (MSE): J(θ)=12mi=1m(hθ(x(i))y(i))2

3. Optimization

  • Gradient Descent: θj:=θjα1mi=1m(hθ(x(i))y(i))xj(i)
  • Normal Equation: θ=(XTX)1XTy

Logistic Regression

1. Hypothesis

hθ(x)=11+eθTx

  • Prediction Rule:
    • Predict y=1 if hθ(x)0.5, otherwise y=0.

2. Cost Function

Log Loss: J(θ)=1mi=1m[y(i)log(hθ(x(i)))(1y(i))log(1hθ(x(i)))]

3. Optimization

  • Gradient Descent: θj:=θjα1mi=1m(hθ(x(i))y(i))xj(i)

4. Sigmoid Properties

  • Output: g(z)[0,1]
  • Derivative: g(z)=g(z)(1g(z))

Ridge Regression

Loss Function

Adds L2 regularization to prevent overfitting: J(θ)=12mi=1m(hθ(x(i))y(i))2+λj=1nθj2

  • λ: Regularization parameter. Higher values shrink θj.

Optimization

  • Closed-form Solution: θ=(XTX+λI)1XTy
  • Gradient Descent: θj:=θjα(1mi=1m(hθ(x(i))y(i))xj(i)+2λθj)

Bayesian Classification

Dataset

  • T={(x1,y1),(x2,y2),,(xN,yN)}
  • xi=(x1,,xn), yi{c1,,cK}

Posterior Probability

The probability of class ck given input x: P(y=ckx)P(y=ck)P(xy=ck)

If features are conditionally independent: P(y=ckx)P(y=ck)jP(xjy=ck)

SVM

Hard SVMHyperplane: H={w|wTx+b=0}Constraint: yi(wTxi+b)1 iGoal: min12||w||2 s.t. yi(wtxi+b)1Lagrangian:L(w,b,α)=12||w||2iαi(yi(wTxi+b)1),αi0Partial derivative: Lw=wiαiyixi=0 Lb=iαiyi=0Solution: ||w||2=(iαiyixi)T(iαiyixi)=ijαiαjyiyjxiTxjLagrangian becomes: L=iαi12ijαiαjyiyjxiTxjs.t. iαiyi=0 and αi0iWeight vector: w=iαiyixiBias: b=yiiαiyixiTxj

Soft SVMHyperplane: H={w|wTx+b=0}Constraint: yi(wTxi+b)1ξi,ξi0,iGoal: min12||w||2+Ci=1nξi,s.t.yi(wTxi+b)1ξi,ξi0
Lagrangian: L(w,b,α,ξ)=12||w||2+Ci=1nξii=1nαi(yi(wTxi+b)1+ξi)i=1nμiξi,αi,μi0Partial Derivative: Lw=wi=1nαiyixi=0,Lb=i=1nαiyi=0,Lξi=Cαiμi=0Solution: ||w||2=i=1nj=1nαiαjyiyjxiTxjDual Problem: L=maxαi=1nαi12i=1nj=1nαiαjyiyjxiTxjs.t. i=1nαiyi=0,0αiC
Weight vector: w=i=1nαiyixiBias: b=yki=1nαiyixiTxkfor any 0<αk<CThe reason that ξ disappears: The slack variables ξi disappear in the dual problem because they are implicitly handled through the Lagrange multipliers αi. By taking the derivative of the Lagrangian with respect to ξi, we obtain:Lξi=Cαiμi=0 This relationship ensures that αi is bounded by 0αiC. Consequently, the slack variables αi do not explicitly appear in the dual formulation. Instead, the dual problem balances maximizing the margin and allowing for misclassification through the constraint on αi.

Kernel SVMHyperplane: H={w|wTϕ(x)+b=0}Constraint: yi(wTϕ(xi)+b)1ξi,ξi0,iGoal: min12||w||2+Ci=1nξi,s.t.yi(wTϕ(xi)+b)1ξiLagrangian (Dual): L(α)=i=1nαi12i=1nj=1nαiαjyiyjK(xi,xj)s.t. i=1nαiyi=0,0αiC,iWeight vector: w=i=1nαiyiϕ(xi)Decision Function: f(x)=sign(i=1nαiyiK(xi,x)+b)Bias: b=yki=1nαiyiK(xi,xk)sup vec 0<αk<CKernel Functions:
Linear: K(xi,xj)=xiTxj
Polynomial: K(xi,xj)=(xiTxj+c)d
Gaussian (RBF): K(xi,xj)=exp(||xixj||22σ2)
Sigmoid: K(xi,xj)=tanh(κxiTxj+c)

MLE and MAP

MLE

构建似然函数:联合分布 L(θ)=i=1nP(Xi|θ)取对数简化计算lnL(θ)=i=1nlnP(Xi|θ)求导并设为 0ddθlnL(θ)=0,解得 θ^MLE验证极值:通过二阶导数等方式确保是最大值。

MAP

结合先验构建后验概率P(θ|X)P(X|θ)P(θ)取对数后验函数lnP(θ|X)lnP(X|θ)+lnP(θ)求导并设为 0ddθlnP(θ|X)=0,解得 θ^MAP验证极值:确保找到最大值。